Software Defect Prediction for High-Dimensional and Class-Imbalanced Data
نویسندگان
چکیده
Software quality and reliability can be improved using various techniques during the software development process. One effective method is to utilize software metrics and defect data collected during the software development life cycle and build defect predictors using data mining techniques to estimate the quality of target program modules. Such a strategy allows practitioners to intelligently allocate project resources and focus more on the potentially problematic modules. Effectiveness of a defect predictor is influenced, among other factors, by the quality of input data. Two problems which often arise in the software measurement and defect data are high dimensionality and class imbalance. This paper presents an approach for using feature selection and data sampling together to deal with the problems. Three scenarios are considered: 1) feature selection based on sampled data, and modeling based on original data; 2) feature selection based on sampled data, and modeling based on sampled data; and 3) feature selection based on original data, and modeling based on sampled data. Several software measurement data sets, obtained from the PROMISE repository, are used in the case study. The empirical results demonstrate that classification models built in scenario 1) result in significantly better performance than the models built in the other two scenarios.
منابع مشابه
Kernel Based Asymmetric Learning for Software Defect Prediction
Software defect prediction is to predict the defect-prone modules for the next release of software or cross project software. Real world data mining applications, including software defect prediction domain, must address the issue of learning from imbalanced data sets. As pointed out by Khoshgoftaar et al. [1] and Menzies et al. [2], the majority of defects in a software system are located in a...
متن کاملTowards Cross-Project Defect Prediction with Imbalanced Feature Sets
Cross-project defect prediction (CPDP) has been deemed as an emerging technology of software quality assurance, especially in new or inactive projects, and a few improved methods have been proposed to support better defect prediction. However, the regular CPDP always assumes that the features of training and test data are all identical. Hence, very little is known about whether the method for C...
متن کاملUsing Class Imbalance Learning for Cross-Company Defect Prediction
Cross-company defect prediction (CCDP) is a practical way that trains a prediction model by exploiting one or multiple projects of a source company and then applies the model to target company. Unfortunately, the performance of such CCDP models is susceptible to the high imbalanced nature between the defect-prone and non-defect classes of CC data. Class imbalance learning is applied to alleviat...
متن کاملارائه یک روش فازی-تکاملی برای تشخیص خطاهای نرمافزار
Software defects detection is one of the most important challenges of software development and it is the most prohibitive process in software development. The early detection of fault-prone modules helps software project managers to allocate the limited cost, time, and effort of developers for testing the defect-prone modules more intensively. In this paper, according to the importance of soft...
متن کاملHeterogeneous Defect Prediction via Exploiting Correlation Subspace
Software defect prediction generally builds models from intra-project data. Lack of training data at the early stage of software testing limits the efficiency of prediction in practice. Thereby researchers proposed cross-project defect prediction using the data from other projects. Most previous efforts assumed the cross-project defect data have the same metrics set which means the metrics used...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011